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Biological sequences may contain patterns that are signal impor- 
tant biomolecular functions; a classical example is regulation of gene 
expression by transcription factors that bind to specific patterns in 
genomic promoter regions. In motif discovery we are given a set of 
O^l . sequences that share a common motif and aim to identify not only 

the motif composition, but also the binding sites in each sequence 
of the set. We present a Bayesian model that is an extended version 
of the model adopted by the Gibbs motif sampler, and propose a 
' new centroid estimator that arises from a refined and meaningful loss 

function for binding site inference. We discuss the main advantages 
■ of centroid estimation for motif discovery, including computational 

' convenience, and how its principled derivation offers further insights 

about the posterior distribution of binding site configurations. We 
also illustrate, using simulated and real datasets, that the centroid 
Qh ' estimator can differ from the maximum a posteriori estimator. 
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1. Introduction. In motif discovery we are given a set of sequences that share a com- 
mon motif and aim to identify the motif composition — the frequency of symbols for each 
position in the pattern — and the positions in each sequence where the motifs are. It is as- 
sumed that the motifs are significantly different, in composition, from sequence background. 
This problem has gained attention and relevance in the past 25 years mainly due to bio- 
logical applications; a classical example is regulation of gene expression by transcription 
factors that bind to specific motifs in genomic promoter regions (Maclsaac and Fraenkel, 
2006; GuhaThakurta, 2006; Sandve and Drablos, 2006). For this reason, we refer to the 
positions where the motifs are realized in the sequences as "binding sites". 
£N) ■ Due to its importance, hundreds of procedures have been proposed for motif discovery 

(Hu et al., 2005; Tompa et al., 2005). While some approaches seek to characterize motifs 
and their binding sites using dictionary methods that capture over-representation of words 
as evidence (Regnier and Denise, 2004; Pavesi et al., 2004), it is common to represent motif 
$_i , compositions by a position weight matrix (Stormo, 2000) and specify a parametric model 

where sequences are generated conditionally on motif and background compositions and 
binding sites. Binding sites can then be regarded as missing data; parameters for the com- 
positions can be estimated using expectation- maximization (Dempster et al., 1977) in a 
frequentist setup, as in MEME (Bailey and Elkan, 1995), or assigned a prior distribution in 
a Bayesian setup (Lawrence et al., 1993; Neuwald et al., 1995). 

Following the Bayesian model from (Liu et al., 1995), we assume that there is only one 
motif of fixed length L and that sequences are generated conditionally independently accord- 
ing to a product multinomial model given binding site positions and motif and background 
compositions. Thus, for an alphabet S, we define 0q = (#o,s)se<s as background probabilities 
of generating each letter in S and, for each position i = 1, . . . , L in the motif, 0i = {9i )S ) s ^s 
as the probabilities of generating each letter at the i-th position in the motif. To simplify 
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the notation we denote O = (6>o,#i, ■ ■ ■ ,0l)- As in (Liu et al., 1995), we set a conjugate 
Dirichlet prior for 0. 

Product multinomial and product Dirichlet models are justified as a good working, first 
approximation based on position independence. There are many extensions to this model 
that consider DNA strand complementarity (Roth et al., 1998), a more informative Markov 
structure for the background composition (Liu et al., 2001), and an explicit representation 
of the number of binding sites per sequence (Thijs et al., 2002). However, since we will be 
discussing a new inferential procedure, we adopt an extended model that yields a feasible 
computational method while still retaining a realistic interpretation and allows us to focus 
the discussion on the proposed estimator. 

Motif discovery is considered a hard problem since motifs are usually short relative to 
sequence length and have a composition that might be hard to distinguish from background 
(see, for instance, (Hu et al., 2005).) It is then imperative to rely on more refined, infor- 
mative estimation methods that better glean information from the posterior distribution of 
binding site configurations. Discrete inferential methods with this goal have recently been 
proposed, including the median probability model of Barbieri and Berger (2004) and the 
centroid estimator (Ding et al., 2005; Carvalho and Lawrence, 2008). Centroid estimation, 
in particular, has been successfully used for motif discovery (Thompson et al., 2007), in- 
cluding models that account for sequence conservation (Newberg et al., 2007). 

In this paper we present a Bayesian model for motif discovery on multiple sequences with 
multiple possible binding sites and formalize a new flavor of inference based on centroid 
estimation. As we will argue, the proposed estimator offers a good representative of the 
posterior space of binding site configurations; moreover, as a by-product of its derivation, we 
obtain informative summaries of the distribution of posterior mass. We start the discussion 
by addressing a simple case when there is only one sequence and we accept only one binding 
site; next we extend the presentation to include multiple binding sites; then, we treat the 
full case when is random, in a fully Bayesian approach. Finally, we offer some concluding 
remarks and directions for future work in the last section. 

2. One sequence, one binding site. Suppose we observe a sequence R, \R\ — ^> 
and wish to infer the location of the only binding site Y, Y G {l,...,n — L + l}. Setting a 
non-informative prior on Y, F(Y) = (n — L + 1) , we have the posterior: 



"(Y \R,Q) 



"(R | Y, 0)P(Y | 0) ¥(R | Y, 0) 



Ef~Ji +1 v(R | y, 0)p(y | ©) PGR \Y,Q)' 



The likelihood, as previously stated, follows a product multinomial distribution given Y: 

L 



3r 



m\Y,e)=n n c =s) n 

seSjeBG j=i 

where j & BG means position j in background. 
One traditional estimator is the MAP estimator, 

Y M = ^argmax ¥(Y\R,Q), 
Y=l,...,n-L+1 



but we argue for an estimator that accounts for differences in positions when comparing 
binding site configurations. Using Bayesian decision theory (Berger, 1985) we look for an 
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estimator that minimizes, on average, a more refined loss function H: 
(1) Y C = argmin E Y \ r^q [H(Y, Y)] . 

Y=l,...,n-L+l 

We adopt a generalized Hamming loss H , 

H(Y,Y)=J2Kk(Y),k(Y)), 

i=l 

where h(Y) returns the "state" of position i: if i is a background position, h(Y) = 0, 
otherwise hiY) = Y — i + 1, that is, li(Y) returns the position in the motif. Loss function H 
compares configurations position-wise according to h, which in turn compares states. One 
option for h when is known is a probability distance, the symmetric Kullback-Leibler 
distance, 

h(i,j) = D KL (9i \\6 j ) + D KL {8 3 || 6 t ) = 0i,s log j± + 9 j>s log ^, 
for i, j = 0, 1,. . . ,L. 

It is, however, not common to have such an informed loss function. An alternative metric 
arises by simply allowing 0j s = 9 S ^ #o,s f° r all s G 5 and j = 1, . . . , L in the motif. In this 
case, if m(i) = I(i > 0) indicates if state i is a motif state, 




Since we are ultimately concerned with the argument of a minimum, as per Equation 1, 
we can define the loss function up to a shift and (positive) scale. Thus, for our inferential 
purposes it suffices to define h(i,j) = I(m(i) ^ to obtain a loss H that accounts 

for overlap in binding sites. Such metric is commonly adopted to measure binding site 
level accuracy, as in the performance coefficients in (Pevzner et al., 2000; Hu et al., 2005; 
Tompa et al., 2005). From now on we will be focusing on this minimally informed loss 
function. 

Estimator Yc is a generalized centroid estimator] for instance, if h is a common zero-one 
loss, h(i,j) = I(i 7^ j), H corresponds to Hamming loss, and thus Yc is the regular centroid 
estimator (Ding et al., 2005; Carvalho and Lawrence, 2008). As Carvalho and Lawrence 
(2008) argue, centroid estimators more effectively represent the space since they are closer 
to posterior means; in contrast, it can be shown that Ym arises from a zero-one loss function 
which yields the posterior mode (Besag, 1986). 

Let us now derive more specific expressions for H and Yc- We first notice that if \Y— Y\ > 
L then the binding sites do not overlap and so H(Y,Y) = 2^/^ =1 h(j,0) = H* , the null 

overlap distance between two configurations. Alternatively, when \Y — Y\ < L then 
\Y-Y\ L L-\Y-Y\ 

(2) H(Y,Y)= h(j,0)+ J2 h (j,0)+ Kj,j + \Y-Y\), 

i =1 j=L-\Y-Y\+l i =1 

since the common backgrounds in Y and Y do not affect H(Y, Y), the first two terms above 
account for the left and right "tails" where binding sites in one sequence are matched with 
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background in the other sequence, and the last term accounts for the overlap in binding 
sites. We also note that H(Y,Y) is actually a function of \Y — Y\. 

Instead of a loss function we can also define our estimator in terms of a gain function 
G{Y,Y) = 1 - H(Y,Y)/H*. Note that < G(Y,Y) < 1; in particular, when \Y — Y\ > L 
there is no gain, G(Y, Y) = 0, and if Y = Y we have G(Y , Y) = 1. As a consequence, we can 
simply write G(Y,Y) = I(\Y -Y\ < L)(l -H(Y, Y)/H*) with H from Equation 2. Noting 
that G, like H, is also a function of \Y — Y\, we obtain the following characterization: 

Theorem 1. The centroid estimator Yc is 

f c = ^ argmax G(Y, •) *P(- 1 R, 9), 

y=l,...,n-L+l 

a convolution between G and the posterior distribution on Y . 

PROOF. The result follows directly from the definition in Equation 1: 
Y c = argmin E Y]Rt@ [H(Y,Y)] 

Y=l,...,n-L+1 

= argmax E Y | R)0 [l{\Y - Y\ < L){1 - H(Y, Y)/H*)] 

Y=l,...,n-L+1 

min{n-L+l,y+L-l} 

= ^ arg max V G(Y, Y)P(Y | R, 9) 

y=l,...,n-L+l r=max { li y_ i+1 } 

= argmax G(Y, •) * P(- | R, 9), 

y=l,...,n-L+l 

as required. □ 

When contrasted to Ym we can see the effect of having a higher resolution loss function: Yc 
gathers probability support from nearby, relative to H, binding site configurations instead 
of just picking the most likely configuration. The following example should give us some 
insight into this new estimator. 

Example 1. Consider the following sequence of length n = 200 from the nucleotide 
alphabet S = {A, C, G, T}, 

10 20 30 40 50 

I I I I I 

GCCACTTTCGGGCCCGTGTCTAACGCACCACGGGCTACGTGACGGTGTGG 
CTCTATACTGACGACGTGAACCAAGCTTTACTGAAGGACTTGCTGTTCCC 
CGACCCATTTCCTGCCAGAACCTCTGACCAGTGTCTAGGGCTATCGCCCG 
TGATGTCTCATGGCGACGCGCGAGGCGGTTGCTCGCCTCACTCCGTTCTG 



and a motif of length L = 6 with parameters 9 given by Table 1. 

Figure 1 shows the conditional marginal posterior P(Y~ | R, 9) and the convolution G * 
P(- | R, 9) used to obtain the centroid Yc = 36, binding at the subsequence TACGTG, close 
to the consensual motif. Note that since 9 is very informative the posterior profile has clear 
peaks and in this case Y c = Ym, the two estimators coincide. 
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Table 1 

Background and motif compositions: background is assumed to be CG-rich, while the motif represents a 
canonical palindromic E-box, CACGTG (Murrea et at, 1989). 
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Fig 1. Conditional marginal probability distribution P(Y | R, O) in solid line and convolution G * P(- | R, O) 
in dotted line. The black thick line close to the axis marks the binding site corresponding to the centroid Yc ■ 



3. One sequence, multiple binding sites. We now allow for multiple binding sites 
by denning Y = as the collection of binding sites Y^. The likelihood is similar, but 

accounts for the multiple binding sites: 

\Y\ L 

f(r \y,g) = n n c* =s) n n ^ +i - i=s) 



l,S 



seSieBG k=li=l 



Given the "entropic" effect of possibly having many binding sites, we need to adopt a 
better prior for Y that takes into account the number of possible configurations for the 
binding sites. So, instead of naively electing P(Y) oc 1, we can explore a hierarchical struc- 
ture: if c(Y) = \Y\, the number of binding sites in Y, we note that ¥(Y) = ¥(Y,c(Y)) = 
P(y | c(Y))¥(c(Y)) and set P(Y | c{Y)) oc 1 and P(c(Y)) oc 1 to obtain 



F(Y) = F(Y | c(F))P(c(y)) 



n-c(Y)(L- 1 
c(Y) ) ' C' 
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where C = [n /L\ is the maximum number of binding sites in R. 

Another, possibly more familiar, approach is to adopt a Markov chain with two states, 
background and motif, where the probability of transitioning to background, either from 
background or motif, and of starting at background is p. In this case we keep F(Y \ c(Y)) 
as before, but now 

(3) P(c(Y)) oc " C ™ L " - P )< Y \ 

since there needs to be c(Y) transitions to the motif state. This prior structure offers more 
flexibility through p: we can further set a hyperprior distribution on p, or specify it directly 
based on the expected number b of binding sites in the sequence; if n is large compared to 6, 
as usual, then p should be close to one, c(Y) is approximately Poisson with mean n(l — p) 
and thus p = 1 — b/n becomes a good candidate. 
The posterior is then 

HY\R,e)- wy|e) 



Z Y nR,Y\e) 

¥(R,Y\Q) ^Y:c(Y)=c(Y) ¥ ( R ^\®) 



Ey :c( y )=c{ y) W, Y I ©) Ec=0 £y :c( y )=c P(*, Y | 9) 

V v ' ' v ' 

¥(Y | c(Y),R,&) ¥(c(Y)\R,S) 

By the structure of our prior it follows that 

P(Y | c { Y),R, 6) = nR\Y,eMY) 

E y:c( y )=c(y) n^l^e)p(y) 

^ ' _ F(R | Y, 6) 

"Ey :c( y )=c(y) n«l^e)' 

and 

z Y . AY)=c(Y) nR\Y,®)nY) 



EloE Y : C{ Y)= c nR\Y,e)F(Y) 

(5) „ 

T,y:c { y)=c(y) F (R I y, e)P(y I c(r))P(c(y)) 



" Eto Ey :c( y )=c I e)P(y | c (y ))P( c (y )) ' 

This decomposition suggests a good approach to sampling from F(Y \ R,Q): we first 
sample c(Y) according to P(c(y) | R, O) and then sample Y given the number of binding 
sites, according to P(y | c(Y), R, 0). 

As we will see next, we need to work more to obtain a centroid estimator for the bind- 
ing sites: we need to establish a hierarchical inferential structure by first finding centroids 
for c(y) = 1, . . . ,C and then proceed to estimate a global centroid. To this end we find 
P(c(y) | R, 0) and then compute marginal posteriors P(Yfc | c(Y),R, O). 

3.1. Marginal posterior on c(Y). From Equations 4 and 5 we observe that we need to 
compute Ey. c (y) =c P(-R | y, ©) up to a constant to find both conditional posteriors of c(Y) 
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and Y and thus the posterior P(Y \R,Q). Let us now denote by Ri j the subsequence of R 
from positions i to j and by Yj.j the binding sites in Y between i and j — that is, all Yfc such 
that i < Yk < j — L + 1 . If we then define forward sums 



(6) F t 



1JU=1 IJLeS ^O.s 

we have that Sy-^y)— c ^(-R I ^> ®) ^ ^c,n- To further simplify the notation, let us define 

£ / /! \ ^(^-i+i=s) 



A0;e) = nil 



i=l seS 



"%.S 

?0,s 



the composition ratio between motif and background for a binding site starting at j. 
The forward sums F c j can be computed recursively, 

(7) Fcj = F C j_i + Fc-ij-xAC; — L + 1; G), 

by considering two options for the tail of the sequence: either having a background position — 
and hence the first summand above — or by having a binding site on the last L positions — and 
thus requiring the second summand. 
Thus, we have 

, , x F^r^-^y'nciY)) 

(8) P(c(T) | R, 0) - 



which yields a straightforward way to sample the posterior c(Y) conditional on 0. 

3.2. Marginal posterior on Yj- given ciY). To compute P(Yfc | c(Y),-R, 0) we now need 
backward sums. We can define them analogously to the forward sums: 



(9) B, 



^>Y j:n :c{Y j:n )=c ^( R j-n I Yj-.n, ©) 
C,J ~ Tin T-r nI{Rr=s) 

lli=j Uses a o,s 



and hence ,y-s =c P(i? | Y, 0) oc -B c ,ii as expected. Moreover, by a similar argument to 
the previous subsection, we also have that the backward sums are recursive: 

(10) B c>j = Bcj+t + S c _i )i+L A(j; 0). 

Having forward and backward sums enable us to readily compute the marginal posterior 
on Yk conditional on c(Y): since 

F(Y k \c(Y) = c,R,e)= J2 nY\c(Y) = c,R,e) 

Yi,...,Y k _ 1 ,Y k+1 ,...,Y c 

f ( r i y > ©) 



Yi,...,Y k _ u Y k+1 ,...,Y c T,Y:c(Y)=c F ( R I F ' ) 
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52 F(R\Y,e)= Yl HRi:Y h -i\Y 1:Yh -i,e) 

Y u ...,Yk-i,Y k+1 ,...,Yc n,...,Yfc-i 

• nRY k :Y k+ L-l | Y Yh V k+L -l,Q) ■ 52 F ( R Y k +L:n I Y Yk+L:n , Q), 

Y k+1 ,-,Y C 

and thus 

(id m i C (Y) = c, r, g) = (c X^r lA(n;e) ^" fc,y ; +L • 

Note that 

E Yc{Y)=c nR \Y,e) 
nr=i n se s 



/(i?i=s) 



n-(c-fc+l)L+l 

E F k _ 1 Y k _ 1 x(Y k -,e)B c __ k ^ +L , 

Y k =(k-1)L 

for fc = 1, . . . , c. 

Before discussing posterior inference we summarize the results of this section in Algo- 
rithm 1. 

Algorithm 1 Computes F(c(Y) \ R, G) and ¥(Y k \ c(Y), R, Q) for k = 1, ... , c(Y). 

Step 1. (Initialize) Set Fb,o = Bo, n +i = -Foj = Bqj = 1 for j = 1, . . . , n; for c = 1, . . . , C, set F c ,j = when 
j < cL and B c j = when j > n — cL + 1. 

Step 2. f Compute forward sums) For c = 1, . . . , C and j = cL + 1, . . . , n do: set F c j as in Equation 7, 

Fc.j = Fcj-t + Fc-u-LXij - L + 1; 0) 
Step 3. f Compute P(c(Y) i?, 0)j For c = 0, . . . , C do: compute marginal posterior c(Y) as in Equation 8, 

P(c(Y) = c LR,0) = 22i ^ * " 

Eto^r^" 15 ) "V(c(Y)=c) 

Step 4. (Compute backward sums) For c = 1, . . . , C and j = n — cL, . . . , 1 do: set _B c j as in Equation 10, 

Sc,j = B c ,j+i + B c -i j+lA(j; O) 

Step 5. (Compute P(Y fe | c(Y), i?, 0),) For c = 1, . . . , C, k = 1, . . . , c, and Y k = (k - 1)L + 1, . . . , n - (c- fe + 
1)L + 1 do: compute marginal posterior Y k given c(Y) as in Equation 11, 

P(Y fc | c(Y) = c,i?,e) = F k - 1 ,Y k -iX(Y k -e)B c - k ,Y k +L/F c , n 



3.3. Posterior Inference. In contrast to the one binding site case from last section, pos- 
terior inference is more difficult since comparing configurations with different number of 
binding sites is not amenable to a systematic approach. Our first approximation is to con- 
sider local estimators for each group of configurations with a fixed number of binding sites 
and then appeal to a triangle inequality: 



H(Y,Y)<H(Y,Y C ) + H(Y C ,Y), 
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where Y is a configuration with c binding sites, Y c is the constrained estimator for all 
configurations with c binding sites, and Y is the (overall) centroid estimator. Recall that 
for the centroid estimator we wish to find Y that minimizes 

c 

K YlR)e [H(Y,Y)]=J2 E H(Y,Y)¥(Y\R,e). 

c=0 Y:c(Y)=c 

Using the triangle inequality for each group we then have 

c 

(12) E Y]Rje [H(Y,Y)]<J2 E [H(Y,Y C ) + H(Y C ,Y)]F(Y\R,Q) 

c=0 Y:c(Y)=c 

C 



c=0 



H(Y,Y C )+ E H(Y c ,Y)P(Y\c(Y) = c,R,e) 

Y:c(Y)=c 



p(c(y) = C\R,&), 



where Y c is an arbitrary point in {Y : c(Y) = c}. Our task is now to find an estimator — let 
us still call it centroid — that minimizes the right-hand bound in Equation 12 above. This 
goal suggests a two-step strategy: 

1. For each number of binding sites, c = 1, . . . , C, find the local centroids 

(13) Y c = arg min E y , c(y)=Cji? ,e [H{Y, Y)] 

Y:c(Y)=c 

as the Y c in Equation 12. 

2. Find the global centroid given the local centroids {^c}£Li) 

(14) Y = arg min E c(Y) , R>e [H(Y c{Y) , Y)] . 

Y 

We note that this strategy does not guarantee that the bound is minimized; the main 
goal here is computational convenience. Let us tackle each step of this heuristic next. 

3.3.1. Local centroids. Even when the number of binding sites is fixed, minimizing the 
conditional posterior expectation of H(Y, Y) can be challenging: we would still have to 
consider for each candidate configuration Y the posterior probability of configurations with 
all binding sites to the left of the first binding site in Y, in-between binding sites in Y, and 
so on. We adopt another approximation and decide to minimize a paired Hamming loss Ha 
where binding site positions are matched according to their order: 

<Y) 

H A (Y,Y) = Y,Hi(Y k ,Y k ), 

k=l 

where iii(Yfc, Y k ) is Hamming loss when comparing sequences with only one binding site at 
Yj, and Y k , respectively, that is, H\{Yk, Y k ) = 2max{|Y/% — Y k \,L}. From the definition we 
have that Ha upper bounds H: Ha(Y, Y) > H(Y, Y). As a bad approximation example, if 
Yj, = Yfc+i for k = 1, . . . , c(Y) — 1 then Ha(Y, Y) = c(Y)L, since each pair of binding sites 
Yfc and Y k does not overlap, while H(Y, Y) = 2L since only Y\ and Y c (y) are in disagreement 
with background. 

The next result adapts Theorem 1 to yield the paired local centroids. 
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Lemma 2. If Pfc(- 1 c{Y) = c,R,&) is the marginal conditional posterior on Y k then the 
•paired local centroids are 

c 

Y c = arg maxj^ G(Y k , ■) *P*(- I c(F) = c, -R, G) 
y :c (Y)= c fe= i 

Proof. In the same spirit of Theorem 1, we use the conditional estimator in Equation 13 
with the paired loss Ha- 

Y c = arg min E Y \ c (Y)=c,R,e [ h a(Y, Y)] 

Y:c(Y)=c 

c 

= argmin ]T ^ H\(Y k , Y k )¥(Y \ c(Y) = c, R, 6) 

Y:c(Y)=c Y :c{Y)=c k=l 

c n-(c-k+l)L+l 

= argmin ^ ^ H x {Y k , Y k )F(Y k \ c(Y) = c, R, 9) 

Y:c(Y)=c k= i Yk = (k-1)L+1 

c min{n-(c-k+l)L+l,Y k +L} 

= argmax^ £ G(Y k , Y k )¥(Y k \ c(Y) = c, R, 9) 

y :C (y)=c fc= i y fc=max {( fc „ 1 ) L+li y fc „ L } 

c 

= arg max ]T G(Y k , •) * P fc (- | c(Y) = c, R, 9), 

Y:c(Y)=c k=1 

and the result follows. □ 

We can spot in Lemma 2 the familiar convolutions, but now with the marginal posteriors 
P(Yfc | c(Y),R, 9) and in a more restricted range. We have a nice characterization, but we 
still have to optimize a sum to obtain the local centroids; to this end we explore the same 
recursive structure that allowed us to compute forward and backward sums. Let us define 
f(Y k ) = G(Y k , •) * Pfc(- 1 c{Y) = c, R, 9) as the convolution against the marginal posterior 
on Y k ; then we should have 



c-l 

(15) _ m § x / f(Yk)= max 

Y:c(Y)=cfr[ Y c =(c-l)L+l,...,n-cL+l 



/(F c ) + ^max 

Y^Yo-i 



This important observation allows us to obtain Y c using the dynamic programming approach 
listed in Algorithm 2, as Theorem 3 formalizes. 

Theorem 3. Algorithm 2 correctly identifies the paired local centroids 

Y c = arg min E Y \ c ( Y )=c,R,e [ h a{Y, Y)] ■ 

Y:c(Y)=c 

Proof. From Lemma 2 we know that Y c is the argument of maxp. c( .y^ =c Ylk=i f(Y k ). The 

key device in Algorithm 2 is to exploit the recursion in Equation 15 to define m\{Yi) = f(Y{) 
and 



(16) 



m k (Y k ) = f(Y k ) + max "ifc-ift-i), 

Y k _ 1 = (k-2)L+l,...,Y k -L 
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Algorithm 2 Find Y c using dynamic programming. 

Construct partial maxima and backtrack pointers: 
Step 1. Set mi(Yi) = /(Yi) for Yi = 1, . . . , n - cL + 1. 

Step 2. For k = 2, . . . , c and Yt, = (k — 1)L + 1, . . . , n — (c — k + 1)£ + 1 do: set backtrack pointers 

A-i(Y fc ) = _ argmax m fc _i(Y fe _i). 

Y fc _ 1 =(fc-2)L+l,...,y fc -Z, 

and set partial sum maximum rrik as 

m k {Y k ) = f{Y k ) +m k - 1 (A k -i{Y k j). 

Reconstruct centroid Y c using backtrack pointers: 
Step 3. Set last binding site position: 

Y CjC = _ argmax m c (Y c ). 

Y c = (c-l)L+l,...,n-L+l 

Note that, by construction, maxy :c( y )=c ^k=i = m c (Y C;C ). 

Step 4. For k = c, . . . , 2 do: recover the remainder of Y c by setting Y Ct k-i — Ak-i(Y c , k ). 



for k > 1, to store partial sum maxima. Now it follows that 

c 

_max V/(Yfc) = _ max m c (Y c ), 

Y:c{Y)=c'jZ~ 1 Y c =(c-l)L+l,...,n-cL+l 

and so Step 3 must be correct. The correctness of Step 4 relies on the right specification of 
m in Steps 1 and 2; but these steps are a straightforward application of Equation 15 using 
the definition of mi and a formulation of Equation 16 based on the backtrack pointers A, 
and so the algorithm is correct. □ 

We note that the paired local centroids minimize an expected posterior upper bound Ha 
on the loss H , and so the actual local centroid might not be attained. We expect, however, 
that for common cases in which the motif coverage c(Y)L is much smaller than n that 
the bound is tight since Ha approximates H well and thus the two local centroids often 
coincide. 

3.3.2. Global centroid. While the local centroids already convey information about the 
distribution of posterior mass in the space of binding site configurations, the end goal 
of the analysis is a point estimate that is, in itself, a good representative of the space. 
Following the strategy we outlined in the beginning of this section, we can further summarize 
the information in the local centroids by identifying a configuration Y that minimizes the 
expected conditional Hamming loss, as in Equation 14. This approach, however, entails the 
same difficulties as defining the centroid based on all points in the space, and it is thus not 
treatable by a systematic approach — we are now just restricting the configurations to the 
local centroids. 

The global centroid can be defined by direct enumeration of all possible configurations 
while keeping the minimizer of the expected conditional posterior loss, but this "brute- force" 
approach considers an exponential number of solutions. A simple heuristic is to restrict the 
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global centroid to be one of the local centroids, 

(17) Y = argmin E c(y) , R)Q [H(Y c{Y) ,Y)] . 

Another alternative is to just take as global centroid the local centroid of the modal number 
of binding sites, Y = Y c *, where c* = argmaxP(c(Y) = c | R, 0). From now on we adopt 

c=0,...,C 

the global centroid in Equation 17 for simplicity and, again, computational expediency. 

Before we continue to our next example, let us remark that a constrained, on the number of 
binding sites, global centroid might be more computationally feasible since we are restricting 
the space of available configurations. For instance, consider the 1-global centroid, 

Y = argmin E y , r Jh(Y,Y) 
Y:c(Y)=l 

As when defining local centroids, we can approximate Y Q using a paired loss, and since 

C 



E 



y\r,& 



h a (y,y)]=Y^ E £#i(i r ,n)P(y|ii,e) 

c=0Y:c(Y)=ck=l 
n C c 

= EE E E^'W^i^ 6 ) 

j=l c =0 Y:c(Y)=ck=l 
n C c 

= E^i(?^)E E E p ( F fc = i i^ e ) 

i=l c=0y :c (Y)=cfc=l 



it 



E ^(y.O-PcCtl-R.e), 



i=l 



where 



O c 

(is) p c (i|i?,©)=E E x;nn=ii^©), 

c=l Y:c(Y)=ck=l 



we have that 



y o = arg min E y | H 

Y:c(Y)=l 



H A (Y,Y) =&rgmaxG(Y,-)*P c (-\R,@). 

Y:c(Y)=l 



It is important to note that while the restriction of one binding site might seem artificial, 
the derivation of Y is helpful in recognizing sequence regions that are likely to host binding 
sites. In fact, since P c captures the posterior probability of having a binding site starting at 
each position, and considering the overlap gain G, the convolution of G and P c highlights 
positions that have higher posterior probability of being covered by a binding site. 



Example 2. We revisit the same sequence from Example 1, but now allow for at most 
C = [n/L\ = 33 binding sites, and adopt the prior given in Equation 3 with 6 = 3 
and thus p = 1 — b/n = 0.985. Using Algorithm 1 we are able to compute the conditional 
marginal posteriors P(c(Y) | R, 0) and P(lfe | c(Y), R, 0) for k = 1, . . . , c(Y). These posterior 
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distributions yield the local centroids — according to Algorithm 2 — and the global centroid 
from Equation 17. In Table 2 we list the marginal posterior ¥(c(Y) = c | R, 0) up to the 
smallest c such that P(c(Y~) < c \ R, 0) > 0.95, along with the local centroids; the global 
centroid Yq is highlighted. Interestingly, the global centroid coincides with the local centroid 
from the modal number of binding sites. 



Table 2 

Centroids and marginal posterior distribution of number of binding sites. The global centroid and the modal 

number of binding sites are highlighted in bold. 



c 


Y c 


P(c(Y) = c\R,Q) 


P(c(Y) <c\R,Q) 







0.014 


0.014 


1 


36 


0.075 


0.089 


2 


36, 147 


0.181 


0.270 


3 


13,36, 147 


0.254 


0.524 


4 


13, 36, 63, 147 


0.233 


0.757 


5 


13,36,63, 147,167 


0.147 


0.904 


6 


3,29,36,63, 147, 167 


0.067 


0.971 



In Figure 2 we display the posterior probabilities of binding site coverage P c from Equa- 
tion 18, along with the convolutions that are needed to define the 1-global centroid Y Q = 36. 
As can be seen, position 36 has a lot of support, being present in all the local centroids 
listed in Table 2; in fact, the probability of a binding site starting at position 36 is greater 
than 50%. 















— Pc(UR) 














G * P c (i | R) 
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Fig 2. Posterior binding site coverage P c in solid line and convolution G * P c in dotted line. Local centroids 
are listed below in gray; the global centroid is in black. 

While P c can provide us guidance for which positions are likely to start a binding site, 
using P c to define local centroids can be misleading. For instance, we could expect that 
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the local centroid with three binding sites — the modal number of binding sites — would be, 
following a decreasing order on P c , 36, 63, and 147. However, if we examine the marginal 
posteriors P(lfc | c(Y) = 3, R, 0) in Figure 3 we realize that position 13 is favored over 
position 63 because, if F k = G*F k {-\c(Y) = 3,i?,0), Fi(13) + F 2 (36) > iq(36) + F 2 (63). 



CM 

b 



b 



o 
b 



o 
b 



LA Ik A AT 



...Ma :M.ii..A . .Uj 1% 




^ -MiA; 



i 

100 
Position 



50 
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200 



Fig 3. Marginal posterior distributions W(Yk\c(Y) = 3, R, O) in solid line and convolutions G * 
P(- |c(Y),7i, O) in dotted line. The local centroid is displayed at the bottom. 



4. Multiple sequences, multiple binding sites per sequence, random motif. 

We are ready to address our model in broader generality: the dataset now comprises m 
sequences, R = {i?.;}"^, and thus binding site configurations are also indexed by sequence, 
Y = {Yi} r £ =l . As before, we have that Y is independent of motif parameters 0, but we 
further assume that sequences and configurations are conditionally independent given 0: 

m m 

(19) F(R,Y \ Q) = ~[[F(R t ,Y l \e) = ~[[F(R l \Y i ,e)F(Y i ). 

i=l i=l 

Given we would be able to apply the methods discussed this far to each sequence 
separately: compute forward and backward sums to obtain marginal posterior probabilities 
for each Yi and then find local centroids and the i-th global centroid. We will, however, 
assume that is random, 

(20) ^~Dir(a,), j = 0,l,...,L, 

independently, and we thus wish to also conduct inference on the background and motif com- 
positions. This assumption, albeit more realistic, complicates matters, since the marginal 
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unconditioned posterior distributions of Y and G are not readily available; we are now re- 
quired to estimate them before obtaining centroid estimates. To this end, we present next 
a Gibbs sampler (Geman and Geman, 1984; Liu, 2008) that draws Yi for each sequence 
given and then samples conditional on the binding site configurations Y, similar to the 
approach in (Liu et al., 1995). 

4.1. Sampling given Y and R. Since the prior on is conjugate, we should be able 
to sample exactly from a Dirichlet distribution. Prom Equations 19 and 20 we have 



wm* nnn C i=s) ncr 1 

_i=ls£Sj€BGi J LsSS 

n fl E™i EjesG; I{Rij=s)+a , 3 -l 

s<=S 

and so 9 | Y, R ~ Dir(iY (Y, R) + a ), where N (Y, R) = {N , s } s es and 



i=l j'eBGi 

is the number of background positions across all sequences that have symbol s. Similarly, 
for the j-th position in the motif, 



F(9j \Y,R)<x 



nnn« 

i=lseSk=l 



I(Ri,Y ik +j-l—s) 
3,t> 



n 



3>s 



n^ETL i TtjeBGi I(Rij=s)+cto, s -l 



and thus 9j \Y,R~ Dir(A^(F, R) + aj ), with Nj(Y, R) = {N jjS } seS and 



Nj,s 



8=1 fc = l 



is the number of motif j-th positions across all sequences and binding sites that have symbol 
s. 



4.2. Sampling Yi given and R. Each configuration Yi for the i-th sequence is condi- 
tionally independent given 0, so we can devise a sampling procedure that can be applied 
to each sequence in turn. To simplify the notation, let us drop the sequence index in what 
follows, that is, Yi is Y, R4 is R, and so on. We will be following a similar approach to 
Sections 3.1 and 3.2, but instead of summing to obtain marginal distributions we will be 
sampling exactly. 

To sample from the conditional posterior on Y, we first sample c(Y) = c according to 
Equation 8 and then proceed to sample Y from its last, c-th binding site up to its first bind- 
ing site. For this reason, this strategy is commonly referred to as "stochastic backtracking", 
since it can be regarded as a stochastic version of Step 4 in Algorithms 1 and 2. Sam- 
pling Y is similar to the predictive update step in (Liu et al., 1995), which, on its turn, is 
based on a stochastic variation of expectation-maximization where missing data is imputed 
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(Tanner and Wong, 1987); however, here we exploit a hierarchical structure on c(Y) and do 
not use the collapsing technique of Liu (1994). 

Exploiting the conditional independence of the sequence configurations and Equation 4 
the last binding site can be sampled using 

£ y F(R\Y,0) 

¥(Y c \c(Y),R,@) 



£y c £y 1 ,...,y c _ 1 p ( i? l y > ) 
1 ' F c _ 1>Yc -iX(Y c ;Q) 



To sample the (intermediate) j'-th binding site we use a similar expression: 

¥(Y j) ...,Y c ,c(Y),R,e) 



¥(Y j \Y j+1 ,...,Y c ,c(Y),R,@) 
(22) 



J2y¥(Y j7 ...,Y c ,c(Y),R,e) 

E? J Ey ll ..^_ 1 P(«|y.©) 
.V. ,Ai>):B) 



By making the convention that Y c+ ± = \R\ + 1 we can reduce Equation 21 to Equation 22. 
Moreover, note that Equation 22 implies that 

F{Yj | Y j+1 , . . . , Y e , c(Y),R, G) = F(Yj \ Y j+1 ,c(Y),R, G), 

as expected. 

We summarize the whole procedure in Algorithm 3. Note how Steps 1.1 to 1.3 are anal- 
ogous to Steps 1 to 3 in Algorithm 1, and how Step 1.4 is an stochastic version of Step 4 
in Algorithm 1: as previously stated, we are now sampling backwards instead of summing 
backwards. To obtain the centroids we follow the procedure described in Section 3.3, but 
adopting Monte Carlo estimates of the marginal posterior distributions, for i = 1, . . . ,m, 



1 T 

(c(Y i ) = c\R)^-Y J l(c(Yl t) ) = c), 



t=i 

m k = j i c&i) = c,r)* EL ; = c) , k=i,...,c, 

£f = i/(c(if) = c ) 

where T is the number of samples. 

Example 3. For the random motif version of Example 2 we simulate m = 20 sequences 
of same length n = 200 using G from Table 1 and the prior for Yi, i = l,...,m, from 
Equation 3 with p = 1 — 1/n = 0.995. 

We continue focusing on the inference of binding site configurations in the same sequence 
from previous examples, which is the first sequence in the simulated dataset. We assume a 
non-informative prior on G by setting otj tS = 1 for s 6 S and j = 0, . . . , L; the prior on each 
sequence Yi is the same prior from Example 2 with p = 0.985. Algorithm 3 is run for 10,000 
iterations to guarantee convergence (diagnostics not shown.) 
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Algorithm 3 Gibbs sampler for P(Y, O | R). 



Set arbitrarily. For t = 1, . . . (until convergence) do: 

Step 1. (Sample Y | O, R) For each sequence i = 1, . . . , m, do: let n — \Ri\, C = \njL\ and sample 

Yi\Ri,e. 

Step 1.1. (Initialize) Set Fqj = 1 for j = 0, 1, . . . , n and for c = 1, . . . , C set i^.j = when j < cL. 

Step 1.2. f Compute forward sums) For c = 1, . . . , C and j = cL + 1, . . . , n do: set F c j as in 
Equation 7, 

F cJ = F CJ _! + F c _ M _iAi(j - L + 1; O^ 1 '), 

where Ai uses i?^. 

Step 1.3. (Sample c(Y^') \ R t , O'* -1 '^ For c = 0, . . . , C do: compute marginal posterior c(Yi) as in 
Equation 8 when applied to the i-th sequence, 

ncm = c\Ri, = ^A-^rnf^c) 

and sample c (t) = c(Y i (t) ) according to ¥(c(Yi) = c| R i: 9 (t_1) ). 
Step 1.4. (Sample Y t (t) | c(F i (t) ) = c (t) , R iy 6 ( * _1) j For fc = c (t \ . . . , 1 do: sample Y^ proportional 
to F Y it)_ x M Y ik\ Q {t ' 1] ) as in Equation 22, 

' ik 

nYU\Y$ +1 ,c(Y} t) ) = cV,R i ,& t - 1) ) = 



F- i,fc+1 f, , v -.AifY^ec- 1 )) 



Step 2. (Sample Q\Y,R) For j = 0, . . . , L compute JVj(Y (t) , R) and then sample 
0f> | Y (t \ i? ~ Dir(Ar 3 (Y( t ', R) + a,). 
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Fig 4. Boxplots of MCMC samples for O (outliers are not shown.) 



The marginal posterior distribution of O can be assessed in Figure 4. Since most positions 
in the sequences are background sequences 6$ has very small posterior variances. Also note 
that the canonical palindromic E-box motif, with consensus CACGTG, is recovered. 

The procedure is now similar to what we presented in Example 2; the main difference is 
that the marginal posterior distributions are estimated from the MCMC samples. Table 3 
lists the estimated marginal posterior distribution of the number of binding sites, the local 
and global centroids. The global centroid does not coincide with the local centroid for the 
modal number of binding sites. Moreover, the local centroids here are different from the 
(conditional) local centroids in Example 2, most likely due to the randomness of being 
taken into account. 



Table 3 

Centroids and estimated marginal posterior distribution of number of binding sites. The global centroid and 
the modal number of binding sites are highlighted in bold. 



c 


Y c 


P(c(Y) = c\R,Q) 


P(c(F) < c\R,Q) 







0.026 


0.026 


1 


29 


0.107 


0.133 


2 


29, 167 


0.210 


0.343 


3 


29, 63, 167 


0.274 


0.617 


4 


13,36, 147,167 


0.201 


0.818 


5 


13,29,63,147, 167 


0.120 


0.938 


6 


13, 29,36,63,147, 167 


0.046 


0.984 



Figure 5 displays the estimated P c , G * P c , and the centroids. We see that compared to 
Example 2 some posterior mass has shifted to positions 29 and to the group of positions 
166, 167, and 168. Here we clearly see the advantage of a centroid estimator: G * P c , and 
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later G*¥k(- \ R), gathers evidence of motif binding from nearby positions, yielding a better 
summary — according to our choice of loss function — of the distribution of posterior mass. 




Fig 5. Estimated posterior binding site coverage P c in solid line and convolution G * P c in dotted line. Local 
centroids are listed below in gray; the global centroid is in black. 

The selection of position 167 in the second local centroid Yi might seem puzzling since 
the peaks at positions 36, 63, and 147 hold higher coverage probabilities. Checking P(Yfc | R) 
in Figure 6 helps dismiss any doubts: most of the support for these positions come from 
configurations with higher number of binding sites, as evidenced by the respective local 
centroids, but these configurations hold relatively low posterior mass. When c(Y) = 2, the 
prior on Y2,2 assigns more posterior probability to higher positions, close to the end of the 
sequence, simply because there are more configurations for Y22 on these positions. It is 
also important to notice that while none of the positions in the cluster 166-168 has higher 
marginal posterior mass than positions 63 and 147, the convolution G*¥2(- \ R) is maximized 
at position 167, that is, the cluster when taken together has more support from the data, 
as weighted by G. 

Example 4. We end this section with an example from the real- world dataset in 
(Tompa et al., 2005), sequence set yst02r. The dataset contains m = 4 sequences each 
with n = 500 letters. We set L = 16 and adopt a non-informative prior on O, as in the 
previous example, and the prior on each Yi, for the i-th sequence, from Equation 3 with 
6 = 3 per thousand positions, so p = 1 — 3/1000 = 0.997. As in the previous example, 10,000 
iterations suffice to reach convergence. 

Let us focus on the second sequence. Figure 7 pictures the binding site coverage probabil- 
ities, along with the local centroids. The global centroid Yc = {85, 105, 169} contains three 
binding sites, and it is also the local centroid for the modal number of binding sites, with 
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Fig 6. Estimated marginal posterior distributions P(Yfc | c(Y) = 2, R, O) in solid line and convolutions G * 
P(- | c(Y),R, O) in dotted line. The local centroid is displayed at the bottom. 

P(c(y) = 3 | R) = 0.32. Since most of the posterior mass in concentrated in configurations 
with c(Y) = 3, the posterior profiles P(lfc I c(X) = 3, R) are similar to P c and are thus 
omitted. 

From the MCMC samples we can produce the MAP estimate Ym = {86, 105, 174} as the 
configuration with highest frequency among the samples: P(Ym I R) = 0.032. In fact, we can 
estimate the posterior probability of each sampled binding site configuration and then, using 
classic multidimensional scaling (Gower, 1966), visualize the estimated posterior distribution 
in Figure 8. It is interesting to note that the null configuration — that is, without binding 
sites — is also very likely with posterior probability 0.024. In contrast, the global centroid 
has very small posterior probability, close to 0.001; it sits, however, closer to configurations 
with high posterior mass, including the local centroids with one, two, and four binding sites. 

To better assess how the centroid estimator is closer to a mean than a mode estimator, 
we plot the estimated posterior distribution of the generalized loss function H centered at 
both Y c and Y M in Figure 9. Since E y | R [H(Y M ,Y)} = 42.40 and E Y \ R [H(Y C ,Y)] = 40.22, 
we see that the binding sites in the centroid configuration are, on average, overlapping 
two extra positions with the binding sites in all the configurations when compared to the 
MAP estimate's binding sites. Both estimates are fairly similar, but the centroid reminds 
us that placing the third binding site at position 169, instead of 174, yields an unlikely 
configuration, but with a higher chance of overlapping with binding sites in positions 160- 
175 that have high posterior probability. In the context of Figures 8 and 9, the centroid 
places itself between two clusters that concentrate posterior mass: one with configurations 
Y such that 25 < H(Yc,Y) < 40 and another with configurations further away, satisfying 
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Fig 7. Estimated posterior binding site coverage P c in solid line and convolution G * P c in dotted line for 
real-world dataset, second sequence. Local centroids are listed below in gray; the global centroid is in black. 



40 < H(Y C ,Y) < 50. 

5. Discussion. In this paper we have presented a Bayesian approach, similar to the 
Gibbs motif sampler in (Lawrence et al., 1993; Liu et al., 1995), that jointly models motif 
and background compositions and binding site locations in a set of sequences. More impor- 
tantly, we discuss and formalize an inferential procedure based on the centroid estimator 
proposed by Carvalho and Lawrence (2008). As in any Bayesian analysis, we wish to eval- 
uate features of interest in a model based on their posterior distribution; however, if we are 
required to pick a representative configuration, a point in the parameter space, then a prin- 
cipled approach is to elect a loss function and conduct formal statistical decision analysis. 
In this sense, by exploring a more refined loss function that depends on position-wise com- 
parisons between sequence states — background or motif positions — we are able to identify a 
better representative of the posterior space of binding site configurations. As pointed out in 
(Carvalho and Lawrence, 2008), the centroid estimator better accounts for the distribution 
of posterior mass; it is more similar to a median than to a mode, and can thus offer better 
predictive resolution than the MAP estimator (Barbieri and Berger, 2004). When applied 
to motif discovery, the centroid estimator captures information in the vicinity of binding 
site positions through a convolution in marginal posterior distributions of binding sites. 

Given the combinatorial number of possible configurations in the parameter space it is 
not feasible to identify the centroid estimate through enumeration or even a systematic ap- 
proach. Yet, we devise an approximative scheme that efficiently optimizes an upper bound 
on the posterior expected loss and thus provides a related centroid. Despite its heuristic 
nature, the proposed method has another advantage besides computational convenience: it 
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Fig 8. Estimated posterior distribution of configurations Y based on MCMC samples and projected using 
multidimensional scaling. The colors code configurations with different number of binding sites. Bold points 
mark local centroids, while a square (bold) point highlights the global centroid. 



allows for an informative depiction of the posterior distribution on binding site configura- 
tions. First, when defining the local centroids, we are able to assess the contributions from 
each binding site through their marginal posterior distributions conditional on the number 
of binding sites, and, in particular, through the convolution of these marginal profiles with 
the gain filter; secondly, when finding the global centroid we explore the marginal posterior 
distribution on the number of binding sites. Moreover, other representations might be help- 
ful in understanding the distribution of posterior mass, as in the use of P c (in Equation 18) 
to pinpoint the 1-global centroid and measure the overall support of the configurations to 
a binding site at some specific position in the sequence. These comments are in the spirit 
of an estimator being also a communicator of the posterior space and the particular choice 
of prior distribution (see Berger, 1985, Section 4.10). 

It is important to note that even when the model is accurate, a poor inference might fail 
in recovering relevant features of the space. In Example 2, the MAP estimate is the null 
configuration, while the centroid indicates three binding sites that represent a group of con- 
figurations that jointly pool significant posterior mass. It is also common that the posterior 
distribution is too complex to be reasonably captured by a single representative; in this 
case the expected posterior loss could also be used to partition the space and further define 
additional representatives as conditional estimates on each subspace. This is a direction of 
work that warrants interest and that we intend to follow next. 

Further improvements can be obtained by specifying a more complex model that accounts, 
for example, for higher order Markov chains with more states for the background, as in 
(Roth et al., 1998; Liu et al., 2001), phylogenetic profiles (Newberg et al., 2007), structural 
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Fig 9. Estimated posterior distribution of loss function centered at Y for the MAP (Y 
(Y = Yc ) estimates. 



Ym ) and centroid 



information (Xing and Karp, 2004), a variable motif length, or dependency among motif 
positions. As pointed out by Hu et al. (2005), motif discovery using sequence only is well 
known for low signal-to-noise ratio; future extensions would also incorporate other data 
sources, such as gene expression or ChlP-Seq data, to increase the signal-to-noise ratio. 
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